fix: EAGLE mix_hidden_states in-place op crash (#1088)#1104
fix: EAGLE mix_hidden_states in-place op crash (#1088)#1104javierdejesusda wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughA fix prevents autograd crashes when EAGLE's mix hidden states mode is enabled by cloning a tensor before in-place indexing. A regression test validates that gradients flow correctly through the EAGLE module during training with mix hidden states enabled. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/torch/speculative/plugins/test_hf_speculative.py (1)
79-80: Consider making the test deterministic.A fixed seed before random input generation can make CI failures easier to reproduce.
♻️ Optional tweak
def test_eagle_mix_hidden_states_backward(eagle_config, eagle_ttt_steps): @@ - input_ids = torch.randint(0, model.config.vocab_size, (2, 16)) + torch.manual_seed(0) + input_ids = torch.randint(0, model.config.vocab_size, (2, 16))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/speculative/plugins/test_hf_speculative.py` around lines 79 - 80, The test uses nondeterministic inputs via torch.randint for input_ids/labels; make it deterministic by setting a fixed RNG seed before generating those tensors (e.g., call torch.manual_seed with a constant value, and if GPU tensors may be used also torch.cuda.manual_seed_all) so that input_ids and labels in test_hf_speculative.py are reproducible across CI runs; place the seed call immediately before the torch.randint line that creates input_ids to ensure determinism.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/torch/speculative/plugins/test_hf_speculative.py`:
- Around line 79-80: The test uses nondeterministic inputs via torch.randint for
input_ids/labels; make it deterministic by setting a fixed RNG seed before
generating those tensors (e.g., call torch.manual_seed with a constant value,
and if GPU tensors may be used also torch.cuda.manual_seed_all) so that
input_ids and labels in test_hf_speculative.py are reproducible across CI runs;
place the seed call immediately before the torch.randint line that creates
input_ids to ensure determinism.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4d359ff0-2b8f-4ce2-baa9-bc9491e18455
📒 Files selected for processing (2)
modelopt/torch/speculative/plugins/transformers.pytests/unit/torch/speculative/plugins/test_hf_speculative.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1104 +/- ##
==========================================
+ Coverage 72.74% 75.58% +2.84%
==========================================
Files 459 459
Lines 48611 48612 +1
==========================================
+ Hits 35361 36745 +1384
+ Misses 13250 11867 -1383
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Clone eagle_input_hiddens before indexed assignment to avoid in-place modification of a tensor in the autograd graph, which causes RuntimeError during backward pass. Mirrors the existing fix in the Megatron backend (megatron_eagle.py:1201-1202). Add regression test parametrized over eagle_ttt_steps [1, 2]. Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>
d963bfc to
9d8f32f
Compare
|
/ok to test 9d8f32f |
Type of change
Description
Fixes #1088 —
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: IndexPutBackward0when training witheagle_mix_hidden_states=True.Root cause: In
HFEagleModel._eagle_training_forward, the indexed assignment at line 991–994 modifieseagle_input_hiddensin-place while it is still part of the autograd computation graph.Fix: Clone the tensor before the in-place assignment. This is the same pattern already used in the Megatron backend at
megatron_eagle.py:1201-1202:The HF backend was missing this clone.
Usage
Testing
Added
test_eagle_mix_hidden_states_backwardparametrized overeagle_ttt_steps[1, 2] that:eagle_mix_hidden_states=Trueeagle_moduleChecklist
Summary by CodeRabbit
Bug Fixes
Tests